Skip to content

Spark is_valid_utf8 function implementation#21627

Merged
Jefffrey merged 8 commits intoapache:mainfrom
kazantsev-maksim:is_valid_utf8
Apr 19, 2026
Merged

Spark is_valid_utf8 function implementation#21627
Jefffrey merged 8 commits intoapache:mainfrom
kazantsev-maksim:is_valid_utf8

Conversation

@kazantsev-maksim
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

N/A

Rationale for this change

Add new spark function: https://spark.apache.org/docs/latest/api/sql/index.html#is_valid_utf8

What changes are included in this PR?

  • Implementation
  • SLT tests

Are these changes tested?

Yes, tests added as part of this PR.

Are there any user-facing changes?

No, these are new function.

@kazantsev-maksim kazantsev-maksim marked this pull request as draft April 14, 2026 16:38
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Apr 14, 2026
@kazantsev-maksim kazantsev-maksim marked this pull request as ready for review April 16, 2026 16:37
fn spark_is_valid_utf8_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
let [array] = take_function_args("is_valid_utf8", args)?;
match array.data_type() {
DataType::Utf8 => Ok(Arc::new(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utf8 types should always be valid, unless theres an edge case I'm missing? We'd just need to return a BinaryArray with all true values, and take the null buffer from input array

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks, fixed.

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good idea to have null test too, as well as array input test (currently these are all scalars)

DataType::Binary => Ok(Arc::new(
as_binary_array(array)?
.iter()
.map(|x| x.map(|y| String::from_utf8(y.into()).is_ok()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map(|x| x.map(|y| String::from_utf8(y.into()).is_ok()))
.map(|x| x.map(|y| str::from_utf8(y).is_ok()))

Avoid need of allocation

@Jefffrey Jefffrey added this pull request to the merge queue Apr 19, 2026
@Jefffrey
Copy link
Copy Markdown
Contributor

Thanks @kazantsev-maksim

Merged via the queue into apache:main with commit d8c9797 Apr 19, 2026
31 checks passed
@kazantsev-maksim kazantsev-maksim deleted the is_valid_utf8 branch April 19, 2026 12:39
@kazantsev-maksim
Copy link
Copy Markdown
Contributor Author

Thanks for the review @Jefffrey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants